Georeferencing Annual Household Surveys in Rural Madagascar: A Comprehensive Analysis of Observatories (1995-2014)
Author
Florent Bédécarrats
Abstract
This technical appendix accompanies a forthcoming data paper. It focuses on the georeferencing process of household panel data collected by the Rural Observatory Network (ROR) from 1995 to 2014. The ROR captured information about the diverse rural contexts in Madagascar. While the quality data gathered was carefully monitored, its geolocation attributes were not fully curated, restricting its utility in spatially integrated research or public policy evaluation. This appendix systematically aligns the varying spatial denominations found in the ROR dataset with a standardized geographic database, to provide a consistent correlation between each ROR observation and corresponding geospatial entities, namely communes and fokontany. We do so leveraging computational notebooks, specifically the Quarto format, utilizing the R programming language. This interactive setup fosters a blend of code, outcomes, explanations, and multimedia resources, ensuring that the ROR data becomes more dependable and user-friendly for spatial analysts and other researchers.
Note: To pilot-test our georeferencing approach, we tested it with 8 observatories (out of the 26 of the rural observatory network). We prioritized the observatories that were in the vicinity of protected areas, as it exemplifies the usefulness of our approach. This procedure will soon be extended to the 18 other observatories.
Introduction
This document serves as a technical appendix accompanying the paper tied to opening of the data collected by the Rural observatory network (ROR, for its acronym in French) from 1995 to 2014. The ROR was a panel survey system that monitored the living conditions of populations in selected localities representing the diversity of rural contexts in Madagascar. While the ROR data underwent a rigorous supervision and curation process during its collection phase, complemented by an exhaustive harmonization effort subsequently, its geolocation attributes remain largely incomplete. This absence of usable geographic reference in the data hinders its integration with other spatially-referenced datasets, particularly when one contemplates its utility in realms such as public policy formulation or project evaluation.
This study endeavors to bridge this gap through a meticulous georeferencing process. The primary objective is to reconcile the somewhat erratic spatial denominations, or toponyms, found within the ROR dataset with a more standardized spatial database. Ultimately, this effort aims to establish a congruent correspondence between each observation from the ROR survey and both a polygon representing a Malagasy municipality and a referential village point.
To achieve this in a reliable way that enables external verification and subsequent improvements, we harness the capabilities of computational notebooks (Quarto format). As interactive computing environments, these notebooks facilitate an amalgamation of code, computational output, explanatory text, and multimedia resources. We use R for the code, as it offeres a high versatility to handle spatial data, complex surveys and textual informaiton. By adopting this methodology, we aim to bolster the reliability and usability of the ROR data for spatial analysts and researchers.
Code
library(tidyverse) # A series of packages for data manipulationlibrary(haven) # Required for reading STATA files (.dta)library(labelled) # To work with labelled data from STATAlibrary(sf) # for spatial data handlinglibrary(tidyverse) # for data wrangling and visualizationlibrary(stringdist) # for string distance and matchinglibrary(tmap) # for mappinglibrary(fuzzyjoin) # for fuzzy joininglibrary(readxl) # Read data frames to Excel formatlibrary(writexl) # Write data frames to Excel formatlibrary(gt) # for nicely formatted tableslibrary(cowplot) # to combine plotslibrary(gtsummary) # to produce nice summary tableslibrary(janitor) # to simply add rowsums
Revisiting the Rural Observatories (ROR) of Madagascar
The Rural Observatory Network (ROR) was established as part of the Madio project (1994-1999), a technical assistance program of the IRD to the national statistical institute (INSTAT) of Madagascar, as an innovative mechanism for monitoring and analyzing the country’s agricultural sector. This initiative arose from the urgent need to compile statistics on the agricultural industry due to the breakdown of the existing statistical system following the socialist era and political turmoil in the country.
Four initial observatories were established in 1995, each representing a different set of agricultural challenges in the country’s diverse ecosystems. From 1995 to 2015, 22 other observatories were created, and existing were abandoned (see Figure 1). The ROR were supported by a variety of donors and NGOs during this period: European union, AFD, World Bank, GTZ, Swiss cooperation, Care and Rio Tinto. The selection of observation sites was based on a combination of criteria, including agroclimatic areas, dominant production systems, population density, accessibility, and the presence of support structures and development projects. These observatories aimed to provide insights into the dynamics of Malagasy agriculture, covering a range of issues such as isolated coastal communities of fishermen and livestock farmers, the impact of trade liberalization on vanilla producers, family smallholdings in the central plateau, and the restructuring of large irrigated rice-growing areas.
Code
# Define a function to load and count surveys per observatory for a given yearload_and_count <-function(year, factorize =FALSE) {# Define file path file_path <-paste0("Données ROR/enter/", year, "/res_deb.dta")# Load data data <-read_dta(file_path)# Extract label and convert to factors if optionif (factorize) { data <- data %>%mutate(across(everything(), as.character),across(where(is.labelled), ~as.character(as_factor(.)))) }# Count surveys per observatory count_data <- data %>%group_by(j0) %>%summarise(survey_count =n()) %>%ungroup() %>%mutate(year = year) # Add year columnreturn(count_data)}# Generate a list of yearsyears <-1995:2014# Use purrr::map_df to loop through each year and bind resultsobs_count <-map_df(years, load_and_count) %>%# Remove rows with observatory "7 " and "NA", whch are errorsfilter(j0 !=7&!is.na(j0)) %>%rename(observatory = j0)# Read observatory namesobservatory_names <- readxl::read_xlsx("Observatory_names.xlsx") %>%select(code, observatory_name = name)# PAss it to wide.obs_count <- obs_count %>%left_join(observatory_names, by =c("observatory"="code")) %>%group_by(observatory_name, year) %>%summarise(survey_count =sum(survey_count))obs_count_wide <- obs_count %>%pivot_wider(names_from = year, values_from = survey_count)# Add observatory approximate locationlocations <-tibble(code =c(1, 2, 3, 4, 12, 13, 15, 16, 21, 22, 23, 24, 31, 25, 41, 42, 43, 51, 44, 45, 61, 17, 18, 19, 71, 52),name =c("Antalaha", "Antsirabe", "Marovoay", "Toliara", "Antsohihy", "Tsiroanomandidy-Bongo", "Farafangana", "Ambovombe", "Alaotra", "Manjakandriana", "Toliara littoral", "Fenerive Est", "Bekily", "Mahanoro", "Soaviandriana-Itasy", "Menabe-Belo", "Fianarantsoa", "Tsivory", "Morondava", "Manandriana", "Tanandava", "Ihosy", "Ambohimahasoa", "Manakara", "Tolanaro", "Menabe-Nord-Est"),latitude =c(-14.8833, -19.8659, -16.1000, -23.3558, -14.8796, -18.7713, -22.8167, -25.1667, -17.8319, -18.9167, -23.3558, -17.3500, -24.6900, -19.9000, -19.1686, -19.6975, -21.4527, -24.4667, -20.2833, -20.2333, -22.5711, -22.4000, -20.7145, -22.1333, -25.0381, -20.5486),longitude =c(50.2833, 47.0333, 46.6333, 43.6683, 47.9875, 46.0546, 47.8333, 46.0833, 48.4167, 47.8000, 43.6683, 49.4167, 45.1700, 48.8000,46.7354, 44.5419, 47.0857, 45.4667, 44.2833, 47.3833, 45.0439, 46.1167, 47.0389, 48.0167, 46.9562, 47.1597))obs_count <-left_join(obs_count, locations, by =c("observatory_name"="name"))madagascar <-st_read("../Entités administratives/OCHA_BNGRC frontières administratives/mdg_admbnda_adm0_BNGRC_OCHA_20181031.shp",quiet =TRUE)# Sort locations by latitude to generate sequence numberslocations <- locations %>%arrange(desc(latitude)) %>%mutate(seq_num =1:n())# Create map plot with labelsmap_plot <-ggplot(data = madagascar) +geom_sf(fill ="lightgray") +geom_point(data = locations, aes(x = longitude, y = latitude, color = name), size =3) +geom_text(data = locations, aes(x = longitude, y = latitude, label = seq_num), vjust =-1, hjust =1, size =3) +theme_void() +theme(legend.position ="none")# Add sequence numbers to observatory names in obs_count dataframeobs_count <- obs_count %>%left_join(locations %>%select(name, seq_num), by =c("observatory_name"="name")) %>%mutate(observatory_with_num =paste0(seq_num, ". ", observatory_name))# Create timeline plot using modified obs_count with observatory_with_numtimeline_plot <-ggplot(obs_count, aes(x = year, y =fct_reorder(observatory_with_num, latitude), color = observatory_name)) +geom_point(aes(size = survey_count), show.legend = F) +theme_minimal() +labs(y =NULL, x =NULL) +theme(axis.text.y =element_text(size =8),legend.position ="none")# Stitch the plots togethercombined_plot <-plot_grid(map_plot, timeline_plot, rel_widths =c(1.3, 2))ggsave("../../Protocole/Figures/ROR_history.png", plot = combined_plot, width =10, height =7, dpi =300)print(combined_plot)
Figure 1: Coarse location of rural observatory and survey years
This set-up was particularly significant for rural development and macroeconomic policy decisions. In the mid-1990s, This initiative was motivated by an urgent need to compile statistics on the agricultural industry due to the breakdown of the existing statistical system. This became crucial as Madagascar experienced significant changes in its rural areas since 1980, particularly with the liberalization of pricing and marketing structures. The Madagascar ROR arose as a part of the broader trend of socio-economic observatories that emerged in sub-Saharan Africa in the 1980s. These observatories are information systems designed to assess changes in society and their impact on specific populations, emphasizing poverty reduction, combating exclusion, and promoting sustainable human development.
Challenges for opening the data
In this notebook, our primary objective is to enhance the georeferencing of the ROR survey data for open data sharing. The initial ROR survey, initiated in 1995, recorded geographical information in varying formats: from “village” to a combination of “municipality”, “village”, and “site”. A significant challenge arose from the fact that while data collection started in 1995, municipalities were only formally established in 1994, with several years required for stabilization. The inherent fluidity in toponyms, predominantly derived from oral traditions, resulted in varied written representations. Our endeavor is to identify, disambiguate, and georeference observations recorded in the ROR data, adopting the Common Operational Datasets (CODs) as a reference, which has been collaboratively defined by OCHA and Madagascar’s BNGRC (National Disaster Management Office).
CODs stand as the bedrock for all preparedness and response operations, especially within the humanitarian sector. Adopted by the IASC in 2008 and revised in 2010, these datasets are pivotal for facilitating informed decision-making during the critical initial hours of a crisis. By ensuring consistency among stakeholders, they simplify data management and establish a shared operational picture of a crisis. Particularly relevant for our purpose is the incorporation of P-codes in CODs. These unique geographic identification codes, found in both Administrative Boundary CODs (COD-ABs) and Population Statistics CODs (COD-PSs), surmount challenges posed by variations in placenames and spellings. For instance, in Madagascar, 81 different administrative level 4 (ADM4) features are labeled “Morafeno”, with six of these existing within ADM3 features also termed “Morafeno”, distinguishable solely by their unique ADM2 features.
P-codes act as reliable geographic identifiers, eliminating errors arising from identical or variably spelled geographic locations. Leveraging the HDX platform, an open platform for cross-crisis data sharing, we fetch this data to ensure the accurate georeferencing of our ROR data. By harnessing the standardized and official spelling of places provided by P-codes, we can amalgamate, harmonize, and analyze data from diverse sources, offering a comprehensive, geo-accurate view of the survey’s findings.
Data Description and Initial Exploration with R
ROR survey data
The ROR survey data is organized in a collection of year-specfic folders ranging from 1995 to 2015. Each yearly folder houses multiple .dta files (Stata data format) – about 85 per year – with diverse filenames such as “res_as.dta” and “res_bp.dta”. Although there’s an element of harmonization achieved, especially regarding certain variables and household identifiers, the data varies in terms of geographical granularity. Initial years primarily provide a singular field denoting the village name. As the years progress, this evolves to include a municipality name, and in the latter years, an additional “site” name occasionally appears. A comprehensive overview of the observations can be gleaned from the “res_deb.dta” files within each year.
Code
# Function to extract variable info for a given year and fileextract_variable_info <-function(year, file) { file_path <-paste0("Données ROR/enter/", year, "/", file)if (!file.exists(file_path)) return(tibble()) data <-read_dta(file_path, n_max =0)tibble(file_name = file,variable_name =names(data),variable_label =var_label(data) %>%as.character(),year = year)}# Obtain all years from the directory structureyears <-list.dirs("Données ROR/enter/", recursive =FALSE, full.names =FALSE)# Use the tidyverse approach to map over years and filesall_vars <-map_df(years, ~{ files_for_year <-list.files(paste0("Données ROR/enter/", .x), pattern ="\\.dta$", full.names =FALSE)map_df(files_for_year, extract_variable_info, year = .x)})# Convert any NULL values in variable_label to "NA"all_vars$variable_label[is.na(all_vars$variable_label)] <-"NA"# Consolidate the information using the tidyverse approachvariable_dictionary <- all_vars %>%group_by(file_name, variable_name) %>%arrange(year) %>%summarise(variable_label =first(variable_label[variable_label !="NA"] %||%"NA"),years_present =list(unique(year))) %>%ungroup() %>%mutate(years_present =map_chr(years_present, ~paste(.x, collapse =",")))# Write the variable dictionary to an Excel filewrite_xlsx(variable_dictionary, "ROR_Variable_Dictionary.xlsx")
The code block above creates a data dictionnary, which you can download by clicking on this link.
Panel attrition
To assess the integrity of the panel data, we assess the attrition that characterizes it. That is, for each survey reiteration, we compute the percentage of household identified in the previous round that are still present in the subsequent round. This produces the following result:
Code
# Function to read and process each fileread_and_process <-function(year) { file_path <-file.path("Données ROR/enter", as.character(year), "res_deb.dta") data <-read_dta(file_path) %>%select(j0, j5) %>%mutate(year = year)return(data)}# Use map to read and process files, then combine with bind_rowsconsolidated_data <-map_dfr(years, read_and_process) %>%mutate(year =as.numeric(year))# NB : j5 codes have been modified in 1996# so we need to replace the ones from 1995hh_96 <-read_dta("Données ROR/enter/1996/res_deb.dta") %>%select(j0, year, j5_96 = j5, j_1995, j12b) %>%filter(j_1995 ==1) %>%select(-j_1995) %>%mutate(year =1995) %>%distinct(j12b, .keep_all =TRUE)consolidated_data <- consolidated_data %>%left_join(hh_96, by =c("j0", "year", "j5"="j12b")) %>%mutate(j5 =ifelse(year ==1995&!is.na(j5_96), j5_96, j5)) %>%select(j0, j5, year)# We need also to discard the 2004 survey in Marovoay that is very particular# cf. Vaillant 2013.consolidated_data <- consolidated_data %>%filter(!(j0 ==3& year ==2005))# Remove duplicates and create the hh_all tablehh_all <- consolidated_data %>%distinct(j0, j5, year, .keep_all =TRUE) %>%arrange(j0, j5)hh_grouped <- hh_all %>%group_by(j0, year) %>%summarise(j5_list =list(j5), .groups ='drop') %>%# Count the number of j5 in j5_listmutate(j5_count =map_int(j5_list, length)) %>%# Create a column to identify the most recent previous year with data for the same observatorygroup_by(j0) %>%mutate(previous_year =lag(year)) %>%ungroup()# Self-join to create previous_year_j5_listattrition_rates_detail <- hh_grouped %>%left_join(hh_grouped %>%select(j0, year, previous_year_j5_list = j5_list,j5_count_previous_year = j5_count), by =c("j0", "previous_year"="year")) %>%mutate(repeated_j5 =map_int(seq_along(j5_list), ~length(intersect(j5_list[[.]], previous_year_j5_list[[.]]))),attrition_rate = (j5_count_previous_year - repeated_j5) / j5_count_previous_year *100)# Pivot the data to have years as columns and observatory numbers as rowsattrition_rates <- attrition_rates_detail %>%select(j0, year, attrition_rate) %>%left_join(locations %>%mutate(observatory_with_num =paste0(seq_num, ". ", name),observatory_with_num =fct_reorder(observatory_with_num, latitude)) %>%select(code, name, observatory_with_num), by =c("j0"="code")) %>%drop_na(name)average_wo_outliers <- attrition_rates %>%filter(attrition_rate <80) %>%summarise(mean =mean(attrition_rate))average_wo_outliers <-round(average_wo_outliers$mean, 1)compound_avg_10y <-round((1-(1-(average_wo_outliers/100))^10)*100) ggplot(attrition_rates, aes(x = year, y = observatory_with_num, fill = attrition_rate)) +geom_tile() +# Create the heatmap tilesgeom_text(aes(label =ifelse(is.na(attrition_rate), "", round(attrition_rate))), color ="black", size =2.5) +scale_fill_gradient2(low ="darkgreen", mid ="yellow", high ="red", midpoint =30, na.value ="grey", name ="Attrition Rate (%)") +labs(x ="Year",y ="Observatory (j0)") +theme_minimal() +labs(y =NULL, x =NULL) +theme(axis.text.y =element_text(size =8))
Figure 2: Attrition rate of ROR panels per observatory and survey round
Annual attrition rates superior to 80% for a specific observatory are likely to be induced by a change in the identification codes and we hope to be able to solve such issue later on. If we discard these outliers (attrition rates over 80%), we have an average attrition year of % wich is very high, leading to a compound attrition rate of % over 10 years. Attrition on ROR data has been studied Gubert and Robilliard (2008); Vaillant (2013).
Administrative boundaries
The “Madagascar Subnational Administrative Boundaries” dataset is sourced from the Common Operational Datasets (CODs), which offer authoritative reference datasets for decision-making during humanitarian operations. Specifically designed to streamline the discovery and exchange of pivotal data, CODs ensure uniformity and use the ‘best available’ datasets. This particular dataset focuses on administrative boundaries, including gazetteers with P-codes, facilitating organized humanitarian assessments and data management. P-codes act as unique identifiers for every administrative unit and populated area, ensuring standardization in nomenclature. When datasets adhere to the P-code standard, their integration and analysis become efficient. The dataset provides comprehensive boundary information for Madagascar at five administrative levels: country, region, district, commune, and fokontany. It’s accessible for download as shapefiles from the provided link.
Localities
The “Madagascar Populated Places” dataset is also part of the Common Operational Datasets (CODs). This dataset encompasses populated place points for Madagascar. The data has been sourced from the National Geospatial-Intelligence Agency and provided by the University of Georgia - ITOS. Further, the Geographic Information Support Team (GIST) has taken up the role of distributor, with the data being published on 2007-03-07. UN OCHA ROSA has enhanced the dataset by adding P-codes and administrative boundary names, which are based on the BNGRC (National Disaster Management Office) data. The dataset geolocates 28184 populated places with their toponyms (names), codes related to various administrative levels such as fokontany, commune, district, and region, and their spatial coordinates.
The treatment of toponyms presents a unique challenge, especially when these names are captured from varied sources. In the dataset, these names can vary due to differences in languages, case sensitivity, and the inclusion of additional descriptive terms. To address this, the clean_string function was developed. This function begins by converting all strings to lowercase, ensuring that subsequent comparisons are not sensitive to case variations. Next, to create a uniform standard, all non-alphanumeric characters are removed, retaining only spaces and the alphanumeric content. Common qualifiers in toponyms, such as “centre”, “haut” (high) or “bas” (low), which are not used consistently across records, are also removed. Given the bilingual nature of the dataset, with entries potentially in both Malagasy and French, the function translates cardinal points to the Malagasy language to ensure uniformity. Lastly, certain locales with multiple names, such as “Fort Dauphin”, also known as “Taolagnaro”, “Tolagnaro” or “Tolanaro”, are standardized to a single term, “Tolanaro”, to eliminate potential disparities. We also address instances of Roman numerals from I to VI, converting them to their Arabic numeral counterparts, ensuring consistent representation across records.
Code
clean_string <-function(x){ x %>%tolower() %>%# Convert to lowercase# Retain spaces, remove other non-alphanumeric charactersstr_replace_all("[^[:alnum:][:space:]]", " ") %>%str_remove_all("\\b(centre|haut|bas|androy)\\b") %>%str_trim() %>%# Trim spaces from start and end of stringstr_replace_all("\\bcentre\\b", "") %>%# Remove the word 'centre'# Translate cardinal pointsstr_replace_all("\\bnord\\b", "avaratra") %>%str_replace_all("\\best\\b", "atsinanana") %>%str_replace_all("\\bouest\\b", "andrefana") %>%str_replace_all("\\bsud\\b", "atsimo") %>%str_replace_all("\\batsinana\\b", "atsinanana") %>%# Replace short form str_replace_all("(fort dauphin)|(taolagnaro)|(tolagnaro)", "tolanaro") %>%# Variations for fort dauphin# Convert Roman numerals to Arabicstr_replace_all("\\bi\\b", "1") %>%str_replace_all("\\bii\\b", "2") %>%str_replace_all("\\biii\\b", "3") %>%str_replace_all("\\biv\\b", "4") %>%str_replace_all("\\bv\\b", "5") %>%str_replace_all("\\bvi\\b", "6")}
Fuzzy matching
By default, statistical softwares and computing language match text by pairing only identical strings. Exact string matching is inappropriate in our context, where location data entry was subject to human errors like typographical mistakes or minor variations in spelling. To avoid this rigidity, fuzzy matching is employed. This approach gauges the degree of similarity between two strings, bypassing the need for an exact character-to-character match. The principle metric adopted for this is the “Levenshtein distance”, which quantifies the minimum number of single-character edits required to change one string into another. The fuzzy_match function encapsulates this approach. The function initiates the process by filtering the reference list of encontered toponyms based on a given observatory code, which considerably narrows down potential matches. Then, using the Jaro-Winkler distance metric — a variant of the Levenshtein distance particularly suited for shorter strings — the function computes the similarity between the target string and entries in the filtered reference. To ensure that only relevant matches are acknowledged, a threshold, termed max_distance, is set. Matches that exceed this threshold are disregarded. For those that pass this validation, the function then extracts the pertinent details of the matched row from the reference dataframe.
Code
fuzzy_match <-function(target_string, dataframe, column_name, observatory_code, max_distance =0.25) {# Filter the dataframe based on observatory_code filtered_reference <- dataframe %>%filter(OBS_CODE == observatory_code) %>%select(all_of(column_name), ADM3_PCODE, ADM3_EN)# If filtered_reference is empty, return NA valuesif (nrow(filtered_reference) ==0) {return(list(matched_string =NA, ADM3_PCODE =NA, ADM3_EN =NA, distance =1)) }# Use stringdist to find the closest match distances <- stringdist::stringdistmatrix(target_string, filtered_reference[[column_name]], method ="jw")# If there are no valid distances, set min_distance to Inf (this should help avoid the error)if(all(is.na(distances))) { min_distance <-Inf } else { min_distance <-min(distances, na.rm =TRUE) # Ensure NA values don't affect the min calculation }# Check for Inf distance and replace it with 1if (is.infinite(min_distance)) { min_distance <-1 }# If min_distance exceeds the max_distance threshold, return NA valuesif (min_distance > max_distance) {return(list(matched_string =NA, ADM3_PCODE =NA, ADM3_EN =NA, distance =NA)) } matched_row <- filtered_reference[which.min(distances), ]return(list(matched_string = matched_row[[column_name]], ADM3_PCODE = matched_row$ADM3_PCODE, ADM3_EN = matched_row$ADM3_EN, distance = min_distance))}
Hierarchical matching of data
The hierarchical organization of spatial entities is key for our challenge. Such an organization allows for a cascading representation of data, from broader scopes narrowing down to more specific layers. This representation is reminiscent of the defining order of geographical entities: regions contain provinces, which contain municipalities, and these in turn contain localities. In the georeferencing context, leveraging this hierarchical structure can lead to more precise matches. For instance, if an observatory code is associated with a specific district, the search for matches is confined to that district, enhancing both the efficiency and accuracy of the process. Figure 3 represents this hierarchical arrangement, serving as a roadmap for the subsequent data matching tasks.
graph TD
A[Legend]
W[ ]
X[ ]
Y[ ]
Z[ ]
W -->|Pre-definede hierarchy| X
Y -.-|Implemented matching|Z
graph TD
%% Administrative Logic
subgraph "Administrative Logic"
A[Region]
B[District]
C[Commune]
D[Fokontany]
I[Populated places]
A --> B
B --> C
C --> D
D --> I
end
%% Observatory Logic
subgraph "Observatory Logic"
E[Observatory network]
F[Observatory]
G[Commune]
H[Village]
J[Site]
E --> F
F --> G
G --> H
F --> H
H--> J
end
%% Geospatial matching
H -.- I
C -.- G
Figure 3: Spatial entities pre-defined relationship and matching
Madagascar’s current administrative setup is straightforward: regions contain districts; districts have communes; and communes are made up of Fokontany. Though a Fokontany should in principle be a single village, it often includes multiple villages or populated areas. It’s worth noting that while the idea of communes has been around for a while, they were only officially recognized in 1994. However, rolling them out took some time after 1994. Before 1994, “communes” simply described local government areas without any formal administrative status. The regions were created in 2004. On the ROR side, observatory refereed during the first surveys to villages. A systematic registry of the communes only started in the 2004 and 2005 round, depending on the observatories. A mention to “sites” also appeared in 2011 but was scarcely documented. Our strategy was to established links between the village information and populated places, and between communes.
A Detailed walk-through
We now break down our approach to describe each subsequent step. We began by exploring the ROR documentation, and in particular the reports associated to community survey, that contain descriptions of the area surveyed by each observatory. While doing so, we updated the COD subnational administrative boundary dataset by adding a field named OBS_Y_N. This field was marked as ‘1’ if the municipality was listed in an observatory survey; if not, it was left empty. Additionally, we added another field, OBS_NUM, which would store the observatory number. If the municipality wasn’t part of any survey, this field was left empty. The list of observatories and surveyed municipalities can be found in the Table 2.
Next, we moved on to geolocation, which took place in four stages. Each stage depended heavily on the data quality and completeness obtained from the previous ones.
Method 1
This phase entailed a systematic alignment process for the municipalities. When municipality names were included in our dataset, an attempt was made to correlate them with names of municipalities identified as data collection locales. For enhanced precision, this alignment was conducted specific to each observatory. Given the potential for variance in nomenclature across sources, a fuzzy matching algorithm was employed. This method calculates the probability of two distinct names referencing the same entity. Upon identifying a probable match, we performed a visual verification for all matches and flagged the false positives to remove them.
The following code segment elucidates the methodology applied:
Code
# List of yearsyears <-1995:2014# Read all datasets and combineall_surveys_description <-map_df(years, function(year) { df <-read_dta(paste0("Données ROR/enter/", year, "/res_deb.dta"))# Convert all columns to character to ensure consistency df <- df %>%mutate_all(as.character)return(df)})# Extract unique combinations and list all the years they appeared inunique_combinations <- all_surveys_description %>%group_by(j0, j42, j4) %>%summarize(years =toString(unique(year)),obs_count =n()) %>%ungroup()# Harmonize the fields that contain municipality or village namesunique_combinations <- unique_combinations %>%mutate(clean_muni =clean_string(j42),clean_village =clean_string(j4))obs_communes <- obs_communes %>%mutate(clean_ADM3 =clean_string(ADM3_EN))pop_places <- pop_places %>%mutate(clean_pname =clean_string(PLACE_NAME)) # List of observatories for which municipalities have been identifiedidentified_observatories <-unique(obs_communes$OBS_CODE) %>%na.omit()# Filter for the observatory for which we already have a manual identification# of municipalitiesunique_combinations <- unique_combinations %>%filter(j0 %in% identified_observatories) # Apply the fuzzy matching observatory-wiseresults <-map2_df(unique_combinations$clean_muni, unique_combinations$j0, ~as.data.frame(t(fuzzy_match(.x, obs_communes, "clean_ADM3", .y)))) %>%unnest(cols =c(matched_string, distance, ADM3_PCODE, ADM3_EN))# Combine the results with the unique_combinationscorrespondence_table <-bind_cols(unique_combinations, results) # Add a column for the matching methodcorrespondence_table <- correspondence_table %>%mutate(method =ifelse(!is.na(matched_string), "method_1", NA_character_))
To synthesize and visually represent our matches, the identified municipalities are plotted on a map:
Method 2: Extracting municipality names from text in “village names”
For observations where the field “municipality name” was empty (all cases before 2004 and still a frequent situation afterwards), we turned our attention to the village name field. The objective was to decipher whether these village names could potentially contain a municipality’s name that had been previously pinpointed during the observatory surveys. For the observations with no municipality value, we teased out the first word or segment in the village name. This extracted word then underwent a process of fuzzy matching against the roster of identified municipalities’ names. However, manual verification spotted errors. Certain matches, which we labeled ‘false positives’, emerged as errors and were purged from the results. Once this cleansing step was concluded, we included the validated matches back into the primary correspondence table.
Code
# Filter out matched municipalities and extract the first wordunmatched_results <- correspondence_table %>%filter(is.na(matched_string)) %>%select(j0:clean_village) %>%mutate(first_word =str_extract(clean_village, "^[^\\s/]+"))# Fuzzy matching with the first word and identified municipalitiesresults_step2 <-map2_df(unmatched_results$first_word, unmatched_results$j0, ~as.data.frame(t(fuzzy_match(.x, obs_communes, "clean_ADM3", .y)))) %>%unnest(cols =c(matched_string, distance, ADM3_PCODE, ADM3_EN))# Update Resultspotential_matches2 <-bind_cols(unmatched_results, results_step2)# Manually identify false positives and remove themfalse_positives <-c("madiromionga", "maroarla", "tsaratanteraka", "ambatoharanana", "ambatoaranana", "maroala", "erakoja", "erakoka", "maroalo", "erakka", "erakoa")validated_matches2 <- potential_matches2 %>%mutate(across(c(matched_string, ADM3_PCODE, ADM3_EN, distance),~ifelse(first_word %in% false_positives, NA, .)),method =ifelse(!is.na(matched_string), "method_2", NA_character_))# Integrate new results in correspondence tablecorrespondence_table <- correspondence_table %>%filter(!is.na(matched_string)) %>%bind_rows(validated_matches2) %>%select(-first_word)
Method 3: Village name fuzzy matching with populated places
With some observations still devoid of a matched municipality, we initiated another processing layer. This phase saw the unmatched village names being subjected to fuzzy matching against the populated places dataset described above. We first sifted out the unmatched results from our prior analysis. Then, we augmented the ‘pop_places’ dataset with observatory codes for the entries that were located in communes recognized as surveyed by observatories. Next we performed another fuzzy matching, comparing the village names against the ‘pop_places’ names. To maintain a level of precision, we set a restrictive threshold, dismissing any match that exceeded a certain “distance” or degree of difference. But, as with the previous methods, visual verification revealed some missmatches. These ‘false positives’ were flagged and discarded and the remaining ones were integrated into the overarching correspondence table.
For the remaining, let’s try matching with other village names for which municipality has been matched.
Even after the above measures, there remained observations that had eluded a municipality match. The subsequent strategy was to juxtapose them with other village names that had already been successfully matched to a municipality. We assembled a dataset comprising of municipality names and their paired village names, using data from previous successful matches. UThen the still-unmatched village names were run through a fuzzy matching process against the known village names. We fine-tuned the algorithm with a restrictive matching threshold. Again, some matches stood out as anomalies. Labeled as ‘false positives’, these were sifted out and the newly matched data was incorporated into the main correspondence table.
Code
# Create a list of municipality names and village names for matched observationsmatched_villages <- correspondence_table %>%filter(!is.na(method)) %>%select(j0, ADM3_PCODE, ADM3_EN, clean_village) %>%distinct() %>%rename(OBS_CODE = j0)# Re-filter unmatched resultsunmatched_results3 <- correspondence_table %>%filter(is.na(method)) %>%select(j0:clean_village)# Try matching unmatched villages against matched village names observatory-wiseresults_village_match <-map2_df( unmatched_results3$clean_village, unmatched_results3$j0,~as.data.frame(t(fuzzy_match(.x, matched_villages, "clean_village", .y, max_distance =0.28)))) %>%unnest(cols =c(matched_string, distance, ADM3_PCODE, ADM3_EN))# Bind these results with unmatched_results_v3potential_matches4 <-bind_cols(unmatched_results3, results_village_match)# Manually identify false positives and remove themfalse_positives4 <-c("amp0mbibitika antanakova", "arakoke ambonano ampihamibe", "farara farara", "farara ambakela", "fara ambakela", "ambatotelo marofinaritra")validated_matches4 <- potential_matches4 %>%mutate(across(c(matched_string, ADM3_PCODE, ADM3_EN, distance),~ifelse(clean_village %in% false_positives4, NA, .)),method =ifelse(!is.na(matched_string), "method_4", NA_character_))# Merge the updated results back to the correspondence tablecorrespondence_table <- correspondence_table %>%filter(!is.na(matched_string)) %>%bind_rows(validated_matches4)
Matching villages
After successfully matching commune names with an official, normalized reference, our next goal is to align the villages as identified by the ROR with their administrative counterpart, known as Fokontany. This task presents challenges, notably because the ROR’s village identifications weren’t directly linked to Fokontany names, and the Fokontany names themselves have undergone changes. Additionally, variations in village naming within the ROR dataset complicate the process. Our approach involves several steps: First, we load the official COD Fokontany dataset. Next, we standardize the place names using our previously discussed clean_string function. Then, using the purpose-built fuzzy_match_village function, we assess the similarity between the village names in the ROR dataset and the Fokontany names within the same commune, specifically by calculating the Jaro-Winkler distance between the strings. We the names with the highest degree of matching. During a manual review of these results, we closely examined the matches, especially those with larger Jaro-Winkler distance. From this examination, we derived a distance of 0.2, below which the majority of matches were accurate. Notably, some matches above this threshold were deemed accurate and retained, while a small number of matches below this threshold were identified as false positives and subsequently discarded.
Code
# Load the ADM4 dataset, which corresponds to Fokontanyfokontany <-st_read("../Entités administratives/OCHA_BNGRC frontières administratives/mdg_admbnda_adm4_BNGRC_OCHA_20181031.shp",quiet =TRUE) %>%mutate(ADM4_clean =clean_string(ADM4_EN))# A function to perform fuzzy matching with Fokontany or populated placesfuzzy_match_village <-function(target_string, dataframe, column_name, ADM3_PCODE_filter, id_column, municipality_string =NULL, max_distance =0.4) {# Check if target_string has multiple words and if the first word matches the municipality_stringif (!is.null(municipality_string) &&length(unlist(strsplit(target_string, " "))) >1) { first_word <-unlist(strsplit(target_string, " "))[1] distances_municipality <- stringdist::stringdist(first_word, municipality_string, method ="jw")if (min(distances_municipality, na.rm =TRUE) <= max_distance) { target_string <-str_remove(target_string, paste0("^", first_word, " ")) } } filtered_reference <- dataframe %>%filter(ADM3_PCODE == ADM3_PCODE_filter) %>%select(all_of(column_name), all_of(id_column))# If filtered_reference is empty, return NA valuesif (nrow(filtered_reference) ==0) {return(list(matched_string =NA, ID =NA, geometry =NA, distance =1)) }# Use stringdist to find the closest match distances <- stringdist::stringdistmatrix( target_string, filtered_reference[[column_name]], method ="jw")# If there are no valid distances, set min_distance to Infif(all(is.na(distances))) { min_distance <-Inf } else { min_distance <-min(distances, na.rm =TRUE) }# Check for Inf distance and replace it with 1if (is.infinite(min_distance)) { min_distance <-1 }# If min_distance exceeds the max_distance threshold, return NA valuesif (min_distance > max_distance) {return(list(matched_string =NA, ID =NA, geometry =NA, distance =NA)) } matched_row <- filtered_reference[which.min(distances), ]return(list(matched_string = matched_row[[column_name]], ID = matched_row[[id_column]], distance = min_distance))}# correspondence table without municipality geometrycorrespondence_table2 <- correspondence_table %>%st_drop_geometry()# For fokontanyresults_fokontany <-pmap_df(list(clean_village = correspondence_table2$clean_village, ADM3_PCODE = correspondence_table2$ADM3_PCODE, municipality_string = correspondence_table2$clean_muni),function(clean_village, ADM3_PCODE, municipality_string) {as.data.frame(t(fuzzy_match_village(clean_village, fokontany, "ADM4_clean", ADM3_PCODE, "ADM4_PCODE", municipality_string))) }) %>%unnest(cols =everything()) %>%rename(ADM4_PCODE = ID, ADM4_clean = matched_string, distance2 = distance) %>%st_drop_geometry() %>%select(-geometry)# Combine the results with correspondence_tablecorrespondence_table2_updated <-bind_cols(correspondence_table2, results_fokontany) %>%mutate(method_village =ifelse(!is.na(ADM4_PCODE), "method_5", NA_character_))# For a visual validationcorrespondence_table2_updated %>%arrange(desc(distance2)) %>%write_xlsx("correspondence_table2_update.xlsx")# After instpecting the data, we find that a 0.2 threshold in distance is # appropriate. However some valid matches above this threshold should be kepthigh_tolerance_valid_matches <-c("ambanja", "ambararata 2", "ambaro", "ambatoharanana", "ambatomanga", "ambazoamazava", "ambazoamirafy", "ambodibonara", "ambodimotso atsimo", "ambohidrony", "ambohimahatsinjo atsinanana", "ambongabe", "amparafaravola", "ampasy", "ampijoroa", "ampijoroan ala", "ampitana", "analambarika", "andohasoamahainty", "anjiamangirana 1", "ankazoabo anivo", "ankerereake", "ankililoaka 2", "bepako", "bevato", "erada 2", "esanta fototra", "esanta marofoty", "esanta maromainty", "feramanga atsimo", "habohabo atsimo", "habohabo avaratra", "lafiatsinanana", "madiromiongana", "manambaro 2", "manantenina", "mangabe", "mangaoka", "maritampona", "maroala", "marofoty", "marolampy", "marovantaza", "miary", "miary ankasy", "miary ankoronga", "miorimivalana", "tanambao", "tanambao 2", "tandroka andrefana", "tsarahasina", "vohilengo")# Below this threshold, a few false positives should be removedlow_tolerance_invalid_marches <-c("AMBOHIMAHATSINJO EST", "AMBOHIMAHATSINJO-EST/CENTRE", "AMBALAKINDRESY/ANDOHARENINA", "ambazoa /ampiha", "AMBALAKINDRESY/ANDOHARENA", "ANKAZOABO SUD/ALAMBARIKE", "ANKARINEZAKA/NAMARINA", "AMBALA/ANDRANOLAVA", "AMBALAKINDRESY/ANDRANOLAVA", "AMBALAKINDRESY/ANDRANOLAVA")correspondence_table3 <- correspondence_table2_updated %>%mutate(across(c(ADM4_clean, ADM4_PCODE, distance2, method_village), ~case_when( (distance2 >0.2&!ADM4_clean %in% high_tolerance_valid_matches) | ADM4_clean %in% low_tolerance_invalid_marches ~NA,TRUE~ .)))
Validating the Quality of Georeferenced Data
To ensure the robustness and validity of our georeferenced data, we employ a multi-tiered validation approach that hinges on quantitative metrics and qualitative consistency checks.
Quantitative metrics
We first look at the distribution of the unique observations that were successfully matched during the various stages of the georeferencing process.
Table 1: Number of ROR observation which commune and village were matched
Result of matching process
Number of survey observation per method
Commune matching method
Matched villages
Matched (%)
Unmatched villages
Unmatched (%)
Total
Total (%)
Method 1
19341
43.82
4054
9.18
23395
53.00
Method 2
8114
18.38
771
1.75
8885
20.13
Method 3
5839
13.23
1464
3.32
7303
16.55
Method 4
1974
4.47
1359
3.08
3333
7.55
Unmatched communes
0
0.00
1222
2.77
1222
2.77
Total
35268
79.90
8870
20.10
44138
100.00
Consistency check with documentation
To ensure the comprehensiveness of our georeferencing process, we summarize the list of unique matches for villages from all the municipalities in the final dataset. This check will help ensure there are no glaring gaps in the matched data.
Code
fokontany_list <- correspondence_table3 %>%mutate(j0 =as.numeric(j0)) %>%left_join(observatory_names, by =c("j0"="code")) %>%rename(OBS_NAME = name) %>%left_join(select(st_drop_geometry(fokontany), ADM4_PCODE, ADM4_EN),by ="ADM4_PCODE") %>%group_by(OBS_NAME, ADM3_EN) %>%summarize(ADM4_EN =list(unique(if_else(is.na(ADM4_EN), "Unknown", ADM4_EN)))) %>%ungroup() %>%mutate(ADM4_EN =map_chr(ADM4_EN, ~str_c(.x, collapse =", ")))fokontany_list %>%gt() %>%# Set column labelscols_label(OBS_NAME ="Observatory Name",ADM3_EN ="Commune Name",ADM4_EN ="Fokontany Names") %>%# Add a titletab_header(title ="Summary of Fokontany List",subtitle ="Breakdown by Observatory and Commune")
Table 2: List of fokontany per commune and observatory
Unknown, Amboditononona, Tanambao I, Andapabe, Ambatomitrozona
Fenerive Est
Miorimivalana
Miorimivalana, Unknown
Fenerive Est
Vohilengo
Vohilengo, Vohipenohely, Unknown
Fenerive Est
NA
Unknown
Marovoay
Ankazomborona
Madiromiongana, Unknown
Marovoay
Antanimasaka
Maroala, Ampijoroa
Marovoay
Marovoay
Unknown, Ampijoroan'ala
Marovoay
Tsararano
Unknown, Bepako
Marovoay
NA
Unknown
Tolanaro
Ampasy Nahampoana
Unknown, Ampasy Centre
Tolanaro
Enakara-Haut
Unknown, Enakara Haut
Tolanaro
Fort-Dauphin
Ambinanibe, Unknown
Tolanaro
Mahatalaky
Unknown
Tolanaro
Manambaro
Manambaro I, Unknown, Manambaro II
Tolanaro
Manantenina
Manantenina Haut, Unknown
Tolanaro
Mandromondromotra
Mandromondromotra, Hovatraha, Unknown
Tolanaro
Ranopiso
Ranopiso, Unknown
Tolanaro
Soanierana
Lafiatsinanana, Unknown, Andranara
Toliara littoral
Ankazoabo Sud
Analambarika, Unknown, Ankerereake, Atsimon-Dalana, Ankazoabo Anivo, Miary Centre, Tanambao II, Tandroka Ouest
Toliara littoral
Ankililoaka
Tanambao, Ambondro, Ampagnolora, Unknown, Ankililoaka II
Toliara littoral
Miary Ambohibola
Miary Ankoronga, Unknown, Miary Ankasy
Toliara littoral
NA
Unknown
We compare this with the ROR documentation to assess which villages might have been omitted by our algorithm.
Visual exploration and avenues for continuous corrections
We visually explore the the georeferenced data to observe its spatial distribution and spot eventual anomalies or clustering patterns that might not be immediately obvious from numerical summaries. The visualization below maps out our matched municipalities alongside villages. Discrepancies or clusters may indicate a need for further refinement or correction.
Code
commmune_pcode_years <- correspondence_table3 %>%# Split and unnest the yearsmutate(years =str_split(years, ", ")) %>%unnest(years) %>%# Convert years to numeric for proper sortingmutate(years =as.numeric(years)) %>%# Group by ADM3_PCODE and extract unique, sorted yearsgroup_by(ADM3_PCODE) %>%summarize(years =list(unique(years))) %>%ungroup() %>%# Sort and collapse the yearsmutate(years =map_chr(years, ~paste(sort(.x), collapse =", ")))fokontany_pcode_years <- correspondence_table3 %>%# Split and unnest the yearsmutate(years =str_split(years, ", ")) %>%unnest(years) %>%# Convert years to numeric for proper sortingmutate(years =as.numeric(years)) %>%# Group by ADM3_PCODE and extract unique, sorted yearsgroup_by(ADM4_PCODE) %>%summarize(years =list(unique(years))) %>%ungroup() %>%# Sort and collapse the yearsmutate(years =map_chr(years, ~paste(sort(.x), collapse =", ")))selected_communes <- commmune_pcode_years %>%filter(!is.na(ADM3_PCODE)) %>%left_join(obs_communes, by ="ADM3_PCODE") %>%st_sf()selected_fokontany <- fokontany_pcode_years %>%filter(!is.na(ADM4_PCODE)) %>%left_join(fokontany, by ="ADM4_PCODE") %>%st_sf() %>%mutate(label ="Surveyed______")Vahatra <-st_read("../Aires protégées/Vahatra/Shapefiles/AP_Vahatra.shp", quiet =TRUE) %>%st_make_valid() %>%mutate(label ="Protected area__")tmap_mode("view")tm_shape(selected_communes) +tm_fill(col ="OBS_NAME", palette ="Set1", title ="Observatory",id ="ADM3_EN",popup.vars =c("Data collection years"="years","District"="ADM2_EN","Region"="ADM1_EN")) +tm_shape(selected_fokontany) +tm_fill(col ="label", palette =c("black"), alpha =0.6, title ="Fokontany", id ="ADM4_EN",popup.vars =c("Data collection years"="years","Commune"="ADM3_EN","District"="ADM2_EN","Region"="ADM1_EN")) +tm_shape(Vahatra) +tm_fill(col ="label", palette =c("darkgreen"), alpha =0.4,popup.vars =c("Year creation"="an_crtn","IUCN category"="cat_icn","Manager"="gest_1","Manager 2"="gest_2"))
Figure 4: Map of the surveyed communes and fokonany in protected areas vicinity
Conclusion
Given the variability in location labels recorded in the ROR data, we emphasized standardizing and aligning toponyms. We introduced the clean_string function to unify place names, addressing variances in language, letter casing, and other descriptors. Utilizing fuzzy matching, we connected similar text strings, using the Levenshtein distance as a benchmark. To heighten accuracy, a hierarchical data matching strategy was applied, reflecting the inherent structure of geographical units, where regions envelop districts, which further consist of municipalities and localities. The COD data, curated by BNGRC and OCHA, was our reference point for these matches. Various techniques, including hierarchical fuzzy matching and visual checks, were employed.
As a result, 98% of commune name variations in the dataset matched with an official commune, and 73% of village name variations correlated with a fokontany. The outcomes are consistent with ROR’s internal documentation, and the spatial representations can be further inspected for validation. Future corrections and enhancements are anticipated, and our system, leveraging Quarto notebook and reproducible R code, is poised to facilitate these refinements.
ROR began its data collection at a time when geolocation best practices weren’t as evolved as they are today. Should more survey rounds be conducted in the future, the intricate georeferencing process demonstrated here should serve both as a cautionary tale and a guide on optimizing data registration (for instance, using PCODES). We’re optimistic that refining the georeferencing of past ROR data, as presented in this work, will amplify the data’s utility for both research and policy decisions.
References
Gubert, Flore, and Anne-Sophie Robilliard. 2008. “Risk and Schooling Decisions in Rural Madagascar: A Panel Data-Analysis.”Journal of African Economies 17 (2): 207–38. https://doi.org/10.1093/jae/ejm010.
Vaillant, Julia. 2013. “Attrition and Follow-Up Rules in Panel Surveys: Insights from a Tracking Experience in Madagascar.”Review of Income and Wealth 59 (3): 509–38. https://doi.org/10.1111/j.1475-4991.2012.00505.x.